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Abstract 

In this paper, we consider the general scenario of resource sharing in a decentralized system when 
the resource rewards/qualities are time-varying and unknown to the users, and using the same resource by 
multiple users leads to reduced quality due to resource sharing. Firstly, we consider a user-independent 
reward model with no communication between the users, where a user gets feedback about the congestion 
level in the resource it uses. Secondly, we consider user-specific rewards and allow costly communication 
between the users. The users have a cooperative goal of achieving the highest system utility. There are 
multiple obstacles in achieving this goal such as the decentralized nature of the system, unknown resource 
qualities, communication, computation and switching costs. We propose distributed learning algorithms 
with logarithmic regret with respect to the optimal allocation. Our logarithmic regret result holds under 
both i.i.d. and Markovian reward models, as well as under communication, computation and switching 
costs. 
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I. Introduction 

In this paper, we consider the general multiuser online learning problem in a resource sharing setting, 
where the reward of a user from a resource depends on how many others are using that resource. We use 
the mathematical framework of multi-armed bandit problems, where a resource corresponds to an arm, 
which generates random rewards depending on the number of users using that resource, at discrete time 
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steps. The goal of the users is to achieve a system objective, such as maximization of the expected total 
utility over time. 

One of the major challenges in a decentralized system is the asymmetric information between the 
users. In cases where communication between the users is not allowed, each of them should act based 
on its own history of observations and actions. In such a case, without feedback about the choices of 
other users, it will be impossible for all users to coordinate on an arbitrary system objective. Therefore, 
we introduce minimal feedback about the actions of the other users. Basically, at each time step, this 
feedback gives to each user the number of users using the same resource with it. When such feedback is 
available, given that the rewards from resources are non-user specific, we show that system utility, i.e., 
the sum of rewards of all users, can be made logarithmically close in time to the utility from the best 
static allocation of resources. The non-user specific resources is a necessary element of this problem, 
since communication is not allowed between users, they can only asses the quality of a resource for 
another user, based on their own perceived qualities. 

Another model we consider allows the users to communicate with each other and share their past 
observations but with a cost of communication. We consider general user-specific random resource rewards 
without the feedback setting discussed above. We propose distributed online learning algorithms that 
achieve logarithmic regret, when a lower bound on the performance gap, i.e., the difference between 
the total reward of the best and second-best allocation, of the total utility function is known. If the 
performance gap is not known, then we propose a way to achieve near-logarithmic regret. 

In addition to the aforementioned models, our algorithms achieve the same order of regret even when 
we introduce computation and switching costs, where the computation cost models the time and resources 
it takes for the users to compute the estimated optimal allocation, while switching cost models the cost 
of changing the resource. 

One of our motivating application is opportunistic spectrum access (OSA) in a cognitive radio network. 
Each user in this setting corresponds to a transmitter-receiver pair, and each resource corresponds to a 
channel over which data transmission takes place. We assume a slotted time model where channel sensing 
is done at the beginning of a time slot, data is transmitted during a slot, and transmission feedback is 
received at the end of a slot. Quality of a channel dynamically changes according to some unknown 
stochastic process at the end of each slot. Each user selects a channel at each time step, and receives 
a reward depending on the channel quality during that time slot. Since the channel quality is acquired 
through channel sensing, the user can only observe the quality of the channel it sensed in the current time 
slot. The channel selection problem of a user can be cast as a decision making problem under uncertainty 
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about the distribution of the channel rewards, and uncertainty due to partially observed channels. 

The regret (sometimes called weak regret) of an algorithm with respect to the best static allocation rule 
at time T is the difference between total expected reward of the best static allocation rule by time T and 
the total expected reward of the algorithm by time T. Note that a static allocation rule is an offline rule in 
which a fixed allocation is selected at each time step, since the feedback received from the observations 
are not taken into account. The regret quantifies the rate of convergence to the best static allocation. 
As T goes to infinity performance of any algorithm with sub-linear regret will converge in terms of its 
average reward to the optimal static allocation, while the convergence is faster for an algorithm with 
smaller regret. 

The main contribution of this paper is to show that order optimal resource allocation algorithms can be 
designed under both limited feedback and costly communication. As we illustrate in subsequent sections, 
if the users are given limited feedback about the number of simultaneous users on the same resource, 
then they can achieve logarithmic regret with respect to the optimal static allocation. Even when the 
users are fully decentralized, assuming a costly communication is possible between the users, they will 
still achieve logarithmic regret with respect to the optimal static allocation. 

The organization of the rest of the chapter is as follows. We discuss related work in Section [n] In 



Section III we give the problem formulation. We study the limited feedback model without communication 



between the users in Section IV Then, we consider the user-specific resource reward model with 



communication in Section [V] We provide an OSA application, and give numerical results in Section 



VI A discussion is given in Section VII| and final remarks are given in Section VIII 



II. Related work 

We can fit our online resource sharing problems in the multi-armed bandit framework. Therefore, the 
related literature mostly consists of multi-armed bandit problems. 

The single -player multi-armed bandit problem is widely studied, and well understood. The seminal 
work [1] considers the problem where there is a single player that plays one arm at each time step, and 
the reward process for an arm is an IID process whose probability density function (pdf) is unknown 
to the player, but lies in a known parameterized family of pdfs. Under some regularity conditions such 
as the denseness of the parameter space and continuity of the Kullback-Leibler divergence between two 
pdfs in the parameterized family of pdfs, the authors provide an asymptotic lower bound on the regret 
of any uniformly good policy. This lower bound is logarithmic in time which indicates that at least a 
logarithmic number of samples should be taken from each arm to decide on the best arm with a high 
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probability. A policy is defined as asymptotically efficient if it achieves this lower bound, and the authors 
construct such a policy which is also an index policy. 

This result is extended in Q to a single player with multiple plays. However, the complexity of 
deciding on which arm to play is shown to increase linearly in time both in flU and (2|; this makes 
the policy computationally infeasible. This problem is addressed in where sample-mean based index 
policies that achieve logarithmic order of regret are constructed. The complexity of a sample-mean based 
policy does not depend on time since the decision at each time step only depends on parameters of the 
preceding time step. The proposed policies are order optimal, i.e., they achieve the logarithmic growth 
of regret in time, though they are not in general optimal with respect to the constant. In all the works 
cited above, the limiting assumption is that there is a known single parameter family of pdfs governing 
the reward processes in which the correct pdf of an arm reward process resides. Such an assumption 
virtually reduces the arm quality estimation problem to a parameter estimation problem. 

This assumption is relaxed in H, where it is only assumed that the reward of an arm is drawn from 
an unknown distribution with a bounded support. An index policy, called the upper confidence bound 
(UCB1), is proposed; it is similar to the one in [3 ] and it achieves logarithmic order of regret uniform in 
time. A modified version of UCB1 with a smaller constant of regret is proposed in [5]. Another work O 
proposed an index policy, KL-UCB, which is uniformly better than UCB1 in 0]. Moreover, it is shown 
to be asymptotically optimal for Bernoulli rewards. Authors in consider the same problem as in |4], 
but in addition take into account empirical variance of the arm rewards when deciding which arm to 
select. They provide a logarithmic upper bound on regret with better constants under the condition that 
suboptimal arms have low reward variance. Moreover, they derive probabilistic bounds on the variance 
of the regret by studying its tail distribution. 

Another part of the literature is concerned with the case where the reward processes are Markovian. 
We refer to this as the the Markovian model. This offers a richer framework for the analysis, and is better 
suited for many real-world applications including opportunistic spectrum access, the central motivation 
underlying this chapter. The Markovian model can be further divided into two cases. 

The first case is the rested Markovian model, in which the state of an arm evolves according to a 
Markov rule when it is played or activated, but otherwise remains frozen. A usual assumption under this 
model is that the reward process for each arm is modeled as a finite-state, irreducible, aperiodic Markov 
chain. This problem is first addressed in [ 8 ] under a parameterized transition model, where asymptotically 
efficient index policies with logarithmic regret with respect to the optimal policy with known transition 
matrices, i.e., the policy that always selects the best arm, is proposed. Similar to the work [4] under the 
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IID model, Q relaxes the parametric assumption, and proves that the index policy in H can achieve 
logarithmic regret in the case of rested Markovian reward by using a large deviation bound on Markov 
chains from ifTOll . 

The second case is the restless Markovian model, in which the state of an arm evolves according to 
a Markov rule when it is played, and otherwise evolves according to an arbitrary process. This problem 
is significantly harder than the rested case; even when the transition probabilities of the arms are known 
a priori, it is PSPACE hard to approximate the optimal policy ifTTl . In |[T2l a regenerative cycle based 
algorithm which reduces the problem to estimating the mean reward of the arms by exploiting the 
regenerative cycles of the Markov process is proposed. A logarithmic regret bound with respect to the 
best static policy is proved when the reward from each arm follows a finite-state, irreducible aperiodic 
Markov chain, and when there is a special condition on the multiplicative symmetrization of the transition 
probability matrix. A parallel development, |[T3l uses the idea of geometrically growing exploration 
and exploitation block lengths to prove a logarithmic regret bound. Both of the above studies utilize 
sample-mean based index policies that require minimum number of computations. Stronger measures 
of performance are studied in 11141 and lTT31l . Specifically, 11141 considers an approximately optimal, 
computationally efficient algorithm for a special case of the restless bandit problem which is called the 
feedback bandit problem studied in ifToll . In lfT6l . a computationally efficient algorithm for the feedback 
bandit problem with known transition probabilities is developed. This development follows from taking 
the Lagrangian of Whittle's LP IfTTl . which relaxes the assumption that one arm is played at each time 
step, to one arm is played on average. The idea behind the algorithm in ifflll is to combine learning and 
optimization by using a threshold variant of the optimization policy proposed in JT61 on the estimated 
transition probabilities in exploitation steps. In [15], an algorithm with logarithmic regret with respect 
to best dynamic policy is given, under a general Markovian model. This algorithm solves the estimated 
average reward optimality equation (AROE), and assigns an index to each arm which maximizes the h 
function over a ball centered around true transition probabilities. The drawback of this algorithm is that 
it is computationally inefficient. 

It should be noted that the distinction between rested and restless arms does not arise when the 
award process is IID. This is because since the rewards are independently drawn each time, whether an 
unselected arm remains still or continues to change does not affect the reward this arm produces the next 
time it is played whenever that may be. This is clearly not the case with Markovian rewards. In the rested 
case, since the state is frozen when an arm is not played, the state in which we next observe the arm is 
independent of how much time elapses before we play the arm again. In the restless case, the state of 
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an arm continues to evolve, thus the state in which we next observe it now becomes dependent on the 
amount of time that elapses between two plays of the same arm. This makes the problem significantly 
more difficult. 

Different from the single -player bandit problems, in a multi-player bandit problem decentralization of 
information and potential lack of coordination among the players play a crucial role in designing efficient 
algorithms. It is desirable to design distributed algorithms that guide the players to optimal allocations. 
Below we review several results from the multi-player bandit literature. 

Most of the relevant work in decentralized multi-player bandit problems assumes that the optimal 
configuration of players on arms is such that at any time step there is at most one player on an arm. 
We call such a configuration an orthogonal configuration. fl"8l and l|T9ll consider the problem under 
the IID model and derive logarithmic upper and lower bounds for the regret assuming that the optimal 
configuration is an orthogonal one. Specifically, the algorithm in [18] uses a mechanism called the time 
division fair sharing, where a player shares the best arms with the others in a predetermined order. By 
contrast, in |[T9l the algorithm uses randomization to settle to an orthogonal configuration, which does 
not require predetermined ordering, at the cost of fairness. In the long run, each player settles down to a 
different arm, but the initial probability of settling to the best arm is the same for all players. The restless 
multi -player Markovian model is considered in |[20li and ED, where logarithmic regret algorithms are 
proposed. The method in [20] is based on the regenerative cycles which is developed for single player 
in |[T2l . while the method in |[2ll is based on deterministic sequencing of exploration and exploitation 
blocks which is developed for single player in [13]. 

Another line of work considers combinatorial bandits on a bipartite graph, in which the goal is to find 
the best bipartite matching of users to arms. These bandits can model resource allocation problems when 
the resource qualities are user-dependent, but resource sharing is not allowed. Specifically, centralized 
multi-user combinatorial bandits is studied in [22] and |[23l . while a decentralized multi-user combinatorial 
bandit is studied in |[24l . Due to the special structure of this problem, the optimal matching can be 
computed efficiently, whereas computing the optimal allocation is NP-hard in a general combinatorial 
bandit. 

Although the assumption of optimality of an orthogonal configuration is suitable for applications such 
as random access or communication with collision models, it lacks the generality for applications where 
optimal allocation may involve sharing of the same channel; these include dynamic spectrum allocation 
based on techniques like code division multiple access (CDMA) and power control. Our results in this 
paper reduce to the results in [21] when the optimal allocation is orthogonal. 
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III. Problem formulation and preliminaries 

We consider M decentralized users indexed by the set M. = {1, 2, . . . , M}, and K resources indexed 

by the set K = {1, 2, . . . , K}, in a discrete time setting t = 1, 2, Quality /reward of a resource varies 

stochastically over discrete time steps, and depends on the number of users using the resource. Each 
resource k has an internal state s k which varies over time in according to an i.i.d. or Markovian rule. The 
quality/reward of a resource depends on the internal state of the resource an the number of users using 
the resource. In the i.i.d. model, the state of each resource follows an i.i.d. process with S = [0, 1], and 
state distribution function F, which is unknown to the users. In the Markovian model, rewards generated 
by resource k follows a Markovian process with a finite state space S k . The state of resource k evolves 
according to an irreducible, aperiodic transition probability matrix P k which is unknown to the users. 
The stationary distribution of arm k is denoted by n k = {ir k ) X £S k - 

Let (P k )' denote the adjoint of P k on hi^) where 

(P k )'xy = (rfPyx)/*x, Vx,y£S k , 

and P k = (P k )'P denote the multiplicative symmetrization of P k . Let v k be the eigenvalue gap, i.e., 1 
minus the second largest eigenvalue of P k , and v m i n = minfcg^tA We assume that the P k, s are such 
that P k, s are irreducible. To give a sense of how weak or strong this assumption is, we note that this 
is a weaker condition than assuming the Markov chains to be reversible. This technical assumption is 
required in the following large deviation bound that we frequently use in the proofs. 

Lemma 1: [Theorem 3.3 from 11011 Consider a finite-state, irreducible Markov chain {X t } t>1 with 
state space S, matrix of transition probabilities P, an initial distribution q and stationary distribution 



7T. Let V q 



(f% £ G S) . Let P = P'P be the multiplicative symmetrization of P where P' is the 
adjoint of P on ^(vr). Let v = 1 — A2, where A2 is the second largest eigenvalue of the matrix P. v will 
be referred to as the eigenvalue gap of P. Let / : S — > R be such that X^eS = 0> ll/lloo — ^ 
and < H/H2 < 1. IfPis irreducible, then for any positive integer T and all < 7 < 1, 

T £ t v \ \ r m-2 



P ( Et =^ {Xt) >l)<V q exv 



T-y z v 
28 



At each time t, user i selects a single resource based on the algorithm ai it uses. Let be the 

resource selected by user i at time t when it uses algorithm a«. Let a(t) = {a\(t), a>2{t), . . . , om (t)} be 
the vector of resource selections at time t. A resource-usage pair is defined as the tuple (k,n), where k 
is the index of the resource and n is the number of users using that resource. We consider two different 
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resource allocation problems in the following sections. In the first model, the resource rewards are user- 
independent. A user i selecting resource k at time t gets reward R\(t) = rfc(s fc ,n fc ), where s\ is the 
state of resource k at time t, and ra fc is the number of users using resource k at time t. In this model, 
communication between the users is not possible, but limited feedback about the user activity, i.e., re*, is 
available to each user who selects resource k. For example, when the resources are channels, and users 
are transmitter-receiver pairs, a user can observe the number of users on the channel it is using by a 
threshold or feature detector. Besides the limited feedback, each user also knows the total number of users 
in the system. This can be done by all users initially broadcasting their IDs. This way, if a user knows 
that the resource rewards are user-independent, it can form estimates of the optimal allocation based 
on its own observations. In the second model, resource rewards are user-dependent and communication 
between the users is possible but incurs some cost. In this model, if user i selects resource k at time t, 
it gets reward R\(t) = r l k {s^,n\). In this case, the user does not need to observe the number of users 
using the same resource with it, since the joint action profile to be selected can be decided by all users, 
or it can be decided by some user and announced to other users. The announcement to user i will include 
re* for the resource it is assigned, and since the users are cooperative, user % is certain that the number 
of users using the same resource with it remains re fc until another joint action profile is decided. 

In the following sections, we consider distributed learning algorithms to maximize the total reward of 
all users over any time horizon T. The weak regret of a learning algorithm at time T is defined as the 
difference between the total expected reward of the best static strategy of users up to time T, i.e., the 
strategy in which a users selects the same resource at each time step up to T, and the total expected 
reward of the distributed algorithm up to time T. Although, stronger regret measures are available for 
the Markovian model, where a user may switch resources dynamically in the optimal strategy weak 
regret is also a commonly used performance measure for this setting ll25l . EDI . EH. A stronger regret 
measure is considered in lfT5l for a single user bandit problem, but the algorithm to find this solution is 
computationally inefficient. In this paper we only consider weak-regret, and whenever regret is mentioned 
it is meant to be weak regret unless otherwise stated. 

In the following sections, we will show that in order for the system to achieve the optimal order of 
regret, a user does not need to know the state of the resource it selects. It can learn the optimal strategy 
by only observing the resource rewards. 
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A. User-independent rewards with limited feedback 

In this model, we assume homogeneous users, i.e., if they happen to select the same resource, they 
will get the same reward. However, the reward they obtain depends on the interaction among users who 
select the same resource. Specifically, when user i selects resource k at time t, it gets an instantaneous 
reward rk(s k ,n k ). Since the users perceive identical resource qualities, we call this model the symmetric 
interaction model. No communication is allowed between the users. However, a user receives feedback 
about the resource it is using, which is the number of users using that resource. We assume that the 
rewards are bounded, and without loss of generality 



When the reward process of arm k is i.i.d. with distribution F, the mean reward of resource-usage pair 

(k, n) is 



When the reward process of arm k is Markovian with transition probability matrix S k , the mean reward 
of resource-usage pair (k, n) is 



where N = {n = (ni, n,2, ■ ■ ■ , uk) '■ n k > 0, n\ + ri2 + . . . + uk = M} is the set of possible allocations 
of users to resources, with being the number of users using resource k. Let 



r k : S k x M -> (0, l],Vfc G K. 





The set of optimal allocations in terms of number of users using each resource is 



K 




K 




k=l 



denote the value of allocation n. Then, the value of the optimal allocation is 



v* := maxu(n). 
neAf 
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For any allocation n G M, the suboptimality gap is denned as 

K 

A(n) := v* - ^n k n Knk . 
k=i 



Then the minimum suboptimality gap is 



min A(n). (1) 



We have the following assumption on the set of optimal allocations: 

Assumption 1: (Uniqueness) There is a unique optimal allocation in terms of the number of users on 
each channel. The cardinality of B, i.e., \B\ = 1. 

Let n* denote the unique optimal allocation when Assumption [T] holds. This assumption guarantees 
convergence by random selections over the optimal channels, when each user knows the optimal allocation. 
Without the uniqueness assumption, even if all users know the set of optimal allocations, convergence by a 
simple randomizations is not possible. In that case, the users need to bias some of the optimal allocations, 



in order to converge to one of them. In Section VII , we explain how the uniqueness assumption can be 



relaxed. The uniqueness assumption implies the following stability condition. 

Lemma 2: (Stability) When Assumption [T] holds, for a set of estimated mean rewards (ik,n k > if lAfc,n fc — 
Mfc,n* I < A min /2M, Vfc G K, n k G M then 

K 

arg max ^ n k jl Knk = B. 

n k=l 

Proof: Let v(n) be the estimated value of allocation n computed using the estimated mean rewards 
fik,n k - Then, \fik,n k — Hk,n k \ < A mm /(2M), V/c G fC,n,k G M implies that for any n G TV, we have 
\v(n) — v(n)\ < A m ; n /2. This implies that 

v* - v(n*) < A min /2, (2) 

and, for any suboptimal allocation n G M 

v(n) - v(n) < A min /2. (3) 

By ([2]) and Q, for any suboptimal n 

fi(n*) - v(n) > A min - 2A min /2 = 0. 
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Stability condition guarantees that when a user estimates the sample mean of the rewards of resource- 
usage pairs accurately, it can find the optimal allocation. In this section we study the case when a lower 
bound on A m j n is known by the users. This assumption may seem too strong since the users do not 
know the statistics of the resource rewards. However, if the resource reward represents a discrete quantity 
such as data rate in bytes or income from resource in dollars then all users will know that A m j n = 1 
byte or dollar. Extension of our results to the case when A m j n is not known by the users can be done 
by increasing the number of samples that are used to form estimates (l^ n over time at a specific rate. In 



Section |VII| we investigate this extension in detail. 
The regret of algorithm a by time t is given by 

T M 



R a (T) = Tv* -E c 



££<<«>(*) 



(4) 



_t=i i=i 

In order to maximize its reward, a user needs to compute the optimal allocation based on its estimated 
mean rewards. This is a combinatorial optimization problem which is NP-hard. We assume that each time 
a user computes the optimal allocation, a computational cost C cmp is incurred. For example, this cost 
can model the time it takes to compute the optimal allocation or the energy consumption of a wireless 
node associated with the computation. Although in our model we assume that computation is performed 
at the end of a time slot, we can modify the model such that computation takes finite number of time 
slots, and C cmp is the total regret due a single computation. 

Sometimes, even when a user learns that some other resource is better than the resource it is currently 
using, it may be hesitant to switch to the new resource because of the cost incurred by changing the 
resource. For example, for a radio, switching to another frequency band requires resynchronization of 
transmitter and receiver. Let C swc be the cost of changing the currently used resource. We assume that 
switching is performed at the end of a time slot, but our results still hold if switching requires multiple 
time slots. 

If we add these costs to the regret, it becomes 

M 

+ C cmp ^ ] m cmp(T) 



R a (T) = Tv* -E c 

M 



T M 

££<«(*) 

t=i i=i 



i=i 



+ C swc J2r4 wc (T), (5) 

i=l 

where m cmp (T) denotes the number of computations done by user i by time T, and m\ wc {T) denotes 
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the number of resource switchings done by user i by time T. Then the problem becomes balancing the 
loss in the performance, the loss due to the NP-hard computation, and the loss due to resource switching. 

Let O* be the set of resources that are used by at least one user in the optimal allocation, and Oi(t) 
be the set of resources that are used by at least one user in the estimated optimal allocation of user i. 
Let Nl (i) be the number of times user i selected resource k and observed n users on it by time t, and 
/2| (t) be the sample mean of the rewards collected from resource-usage pair (k,n) by the ith play of 
that pair by user i. 

B. User-specific rewards with costly communication 

In this model, resource rewards are user-specific. Users using the same resource at the same time may 
receive different rewards. Let r^(s^ , n^) be the instantaneous reward user i gets from resource k at time 
t. The expected reward of resource-usage pair (k, n) is given by 



/4,n : = / ri(s,n)F(ds), 



for the i.i.d. model, and by 



s£S k 

for the Markovian model. In this case, the set of optimal allocations is 

M 

' a<=JC 



A:=aig max ■ (6) 



i=l 



Similar to the definitions in Section III-A[ v(a) is the value of allocation a, v* is the value of an optimal 



allocation, A (a) is the suboptimality gap of allocation a and A m i n is the minimum suboptimality gap, 
i.e., 

A m i n ■= v* — arg max A (a). 

aeK M -A- 

In order to estimate the optimal allocation, a user must know how resource qualities are perceived 
by other users. Note that the limited feedback discussed in the previous section is not enough in 
this case, since it only helps a user to distinguish the rewards of resource-usage pairs of the same 
resource. Therefore, in this section we assume that communication between the users is possible. This 
can either be done by broadcasting every time communication is needed, or broadcasting the next time to 
communicate on a specific channel initially, and then using time division multiple access on that channel 
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to transmit information about the resource estimates, and next time step to communicate. Every time 
a user communicates with other users, it incurs cost C com . Considering the computation cost C cmp of 
computing Q, and the switching cost C swc the regret at time T is 



R a (T) := Tw* - E c 



T M 



EE <(*)(*) 

.t=l i=l 



M 



i=l 



M M 



+ C swc ^2mi wc (T) + C com ^2ml om (T), (7) 
i=i i=i 

where rri L com (T) is the number of times user % communicated with other users by time T. 

The following stability condition, which is the analogue of Lemma [2j states that when the estimated 
rewards of resource-usage pairs are close to their true rewards, the set of estimated optimal allocations 
is equal to the set of optimal allocations. 

Lemma 3: (Stability) For a set of estimated mean rewards fi\ n , if |Afc n ~ ^kn\ < A m i n /(2M), 
Vi e M, k G fC, n € M then, 

M 

ar §^E^,n Ql (o) =A 
i=l 

Proof: The proof is similar to the proof of Lemma [2] Note that the value of each allocation is sum 
of M resource-usage pairs. ■ 
In the following sections we will present an algorithm that achieves logarithmic regret. However, 
this algorithm requires that a lower bound on A m i n is known by the users. However, generally such a 
parameter is unknown to the users since the users learn the mean resource rewards over time. In Section 



VII we propose a method to solve this problem, which gives a near-logarithmic regret result. 



IV. A DISTRIBUTED SYNCHRONIZED ALGORITHM FOR USER-INDEPENDENT REWARDS 

In this section we propose a distributed algorithm for the users which achieves logarithmic regret 
with respect to the optimal allocation without communication by using the partial feedback described in 
the previous section. We will show that this algorithm achieves logarithmic regret both in the i.i.d. and 
the Markovian resource models. Our algorithm is called Distributed Learning with Ordered Exploration 
(DLOE), whose pseudocode is given in Figure [T] The DLOE algorithm uses the idea of deterministic 
sequencing of exploration and exploitation with initial synchronization, first proposed in 12~T1 . to achieve 
logarithmic regret for the Markovian rewards. A key difference between the problem studied here and 
that in 11211 is that the latter assumes that the optimal allocation of users to resources is orthogonal, i.e., 
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there can be at most one user using a resource in the optimal allocation. For this reason the technical 
development in ||211 are not applicable to the general symmetric interaction model introduced in this 
paper. 

A. Definition of DLOE 

DLOE operates in blocks. Each block is either an exploration block or an exploitation block. The length 
of an exploration (exploitation) block geometrically increases in the number of exploration (exploitation) 
blocks. The parameters that determine the length of the blocks is the same for all users. User i has 
a fixed exploration sequence Mi, which is a sequence of resources to select, of length N'. In its Zth 
exploration block user i selects resources in M% in a sequential manner by selecting a resource for 
c 1 ^ 1 time slots before proceeding to the next resource in the sequence. The zth resource in sequence 
Mi is denoted by M%{z). The set of sequences Mi, ■ . . ,Mm is created in a way that all users will 
observe all resource-usage pairs at least once, and that the least observed resource-usage pair for each 
user is observed only once in a single (parallel) run of the set of sequences by the users. For example 
when M = 2, K = 2, exploration sequences M\ = {1,1,2,2} and Mi = {1,2,1,2} are sufficient 
for each user to sample all resource-usage pairs by once. Note that it is always possible to find such 
a set of sequences. With M users and K resources, there are K M possible assignments of users to 
resources. Index each of the K possible assignments as {a(l), ct(2), . . . ,a(K M )}. Then, using the 
set of sequences Mi = {aj(l), «i(2), . . . , ai(K M )} for i G M., all users sample all resource-usage pairs 
by at least once. 

The sequence Mi is assumed known to user i before the resource selection process starts. Let l l (t) be 
the number of completed exploration blocks and l\(t) be the number of completed exploitation blocks 
of user i by time t, respectively. For usesr i, the length of the Zth exploration block is N' c l ~ l and the 
length of the /th exploitation block is ab 1 ^ 1 , where a,b,c are positive integers greater than 1. 

At the beginning of each block, user i computes N % (t) := ^!=i ° l _1 - ^ ^o(^) — ^1°S*' user * 
starts an exploitation block at time t. Otherwise, it starts an exploration block. Here L is the exploration 
constant which controls the number of explorations. Clearly, the number of exploration steps up to t is 
non-decreasing in L. Since the estimates of mean rewards of resource-usage pairs are formed by sample 
mean estimates of observations during exploration steps, by increasing the value of L, a user can control 
the probability of deviation of estimated mean rewards from the true mean rewards. Intuitively, L should 
be chosen according to A m i n , since the accuracy of estimated value of an allocation depends on the 
accuracy of estimated mean rewards. 
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Because of the deterministic nature of the blocks and the property of the sequences Mi , . . . , Mm 
discussed above, if at time t a user starts a new exploration (exploitation) block, then all users start a 
new exploration (exploitation) block. Therefore l^t) = l 3 Q (t), N^t) = N ] Q (t), l}(t) = l{{t), for all 
i,j G M.. Since these quantities are equal for all users we drop the superscripts and denote them by 
lo(t), No(t), li(t). Let t\ be the time at the beginning of the Zth exploitation block. At time t\, I = 1, 2, . . ., 
user % computes an estimated optimal allocation n l (l) = {n\(l), . . . ,n l K (l)} based on the sample mean 
estimates of the resource-usage pairs given by 

K 

n l (l) = argmaxJ2 n kfa,n k (Nl , nk (t t )), 

k=l 

and chooses a resource from the set Oi{l) which is the set of resources selected by at least one user in 
h l {l). During the Zth exploitation block, in order to settle to an optimal configuration, if the number of 
users on the resource ai{t) user i selects is greater than h l a then in the next time step the user 
i randomly chooses a resource within Oi(l). The probability that user i chooses resource k G Oi{l) is 
h l k /\Oi{l)\. Therefore, it is more likely for a user to select a resource which it believes a large number of 
users use in the optimal allocation than a resource which it believes is used by smaller number of users. 
This kind of randomization guarantees that the users settle to the optimal allocation in finite expected 
time, given that the estimated the optimal allocation at the beginning of the exploitation block is equal 
to the optimal allocation. 

B. Analysis of the regret of DLOE 

Next, we analyze the regret of DLOE. There are three factors contributing to the regret in both i.i.d. 
and the Markovian models. The first one is the regret due to exploration blocks, the second one is the 
regret due to incorrect computation of the optimal allocation by a user, and the third one is the regret due 
to the randomization before settling to the optimal allocation given that all users computed the optimal 
allocation correctly. In addition to those terms contributing to the regret, another contribution to the 
regret in the Markovian model comes from the transient effect of a resource-usage pair not being in its 
stationary distribution when chosen by a user. This arises because we compare the performance of DLOE 
in which users dynamically change their selections with the mean reward of the optimal static allocation. 

In the following lemmas, we will bound the parts of the regret that is common to both i.i.d. and the 
Markovian models. Bounds on the parts of the regret, which depends on the resource model, are given 
in the next two subsections. 
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If a user starts exploration block at time t, then 

lo{t) c lo(t) _ l 

> c^ 1 < Llogt => — < Llogt 

z — ' c — 1 

1=1 

c ZoW < (c- l)Llogt + l 
=>io(t) <log c ((c-l)Llogt) + l. (8) 

Let ?b(t) be the time spent in exploration blocks by time t. By ^ we have 

Tdt) < eVv- < jv^; < = N - L]ogt . (9) 

Lemma 4: For any f > in an exploitation block, regret due to explorations by time t is at most 

MN'Llogt. 

Proof: Due to the bounded rewards in (0, 1], an upper bound to the worst case is when each user 
loses a reward of 1 due to suboptimal decisions at each step in an exploration block. Result follows the 
bound {9} for T Q (t). ■ 
By time t at most t — N' slots have been spent in exploitation blocks (because of initial exploration first 
N' slots are always in an exploration block). Therefore 

y ab i-i = a b — <t -N> 

^ b-l ~ 

1=1 

^ h Ht) < tllft-N') 
a 

^l^Klog^^it-N')^. (10) 

Next lemma bounds the computational cost of solving the NP-hard optimization problem of finding 
the estimated optimal allocation. 

Lemma 5: When users use DLOE, the regret due to the computations by time t is upper bounded by 

C cmp M log b f^(t - N')j , 

Proof: Optimal allocation is computed at the beginning of each exploitation block. Number of 
exploitation blocks is bounded by (TTOb, hence the result follows. ■ 
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C. Analysis of regret for the i.i.d. problem 

In this subsection we analyze the regret of DLOE in the i.i.d. model. We note that in the i.i.d. model 
the reward of each resource-usage pair is generated by an i.i.d. process with support in (0, 1]. Apart from 
the bounds on the parts of the regret given in Section [TV] our next step is to bound the regret caused by 
incorrect calculation of the optimal allocation by some user by using a Chernoff-Heoffding bound. Let 
e := A m i n /(2M) denote the maximum distance between the estimated resource-usage reward and the 
true resource-usage reward such that Lemma [2] holds. 

Lemma 6: Under the i.i.d. model, when each user uses DLOE with constant L > 1/e 2 , regret due to 
incorrect calculations of the optimal allocation at the beginning of the Zth exploitation block is at most 

M 3 K(log(ti) + l). 

Proof: Let H(tf) be the event that at the beginning of the Ith exploitation block, there exists at 
least one user who computed the optimal allocation incorrectly. Let ui be a sample path of the stochastic 
process generated by the learning algorithm and the stochastic resource rewards. The event that user i 
computes the optimal allocation incorrectly is a subset of the event 

{|/4,nWU^)) " /4,nl ^ 6 for some ke)C,n£M}. 
Therefore H(ti) is a subset of the event 

{|Afc,n( JV *,n(*i)) " /4,™l > e for some i G M, k G K, n G M}. 
Using a union bound, we have 

M K AI 
i=l k=l n=l 

Then taking its expected value, we get 

M K M 
i=l k=l n=l 

Since 

P(|a-6| > e) = 2P(a-b> e), 
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for a, b > 0, we have 



PmA N lnitl)) - 4,„l > <0 = 2P(£l n (t,)) - An > <0- 



(12) 



Since / is an exploitation block we have Nf(t{) > Llogi;. Let rji, n (i) := ^(s^, ) and let t\ n {l) 
denote the time when user i chooses resource k and observes n users on it for the Zth time. We have 



Using a Chernoff-Hoeffding bound 



P(ti,n( N kn(tl)) ~ 4,n > = P \ £ r k,n(*l,n(*)) > ^n^Rn + 

1 



2=1 

< ^^(tOe 2 < e -2Llogt !e 2 



*2' 



(13) 



where the last equality follows from the fact that L > 1/e 2 . Substituting ( 12 ) and ( 13 1 into ( 1 1 ), we have 



P(oj G H(t t )) < M 2 K^. 



(14) 



The regret in the Zth exploitation block caused by incorrect calculation of the optimal allocation by at 
least one user is upper bounded by 



Mab^Piu G H(ti)), 



since there are M users and the resource rewards are in (0, 1]. From ( 10 1 we have ab l 1 < tp Therefore 



the regret caused by incorrect calculation of the optimal allocation by at least one user by time ti is 

/ i i 

J2 M ab l ^P(uJ G H(ti)) < M 3 Kj2 ab ^ 1 ^2 ^ M 3 kJ2~ < 



z=l 



z=l 

t 



z=l 



^ 1 



M 3 K^- < M 3 K(\og(t l ) + 1). 



t=i 



The following lemma bounds the expected number of exploitation blocks where some user computed 
the optimal allocation incorrectly. 

Lemma 7: When users use DLOE with L > 1/e 2 , the expected number of exploitation blocks up to 
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any t in which there exists at least one user who computed the optimal allocation wrong is bounded by 



E 



1=1 



1=1 1 



where /3 = £ t =i 1/t 2 . 



Proof: Proof is similar to the proof of Lemma [6] using the bound (14i for P{uj G H(ti)). ■ 
Finally, we bound the regret due to the randomization before settling to the optimal allocation in 
exploitation slots in which all users have computed the optimal allocation correctly. 

Lemma 8: Denote the number of resources which are selected by at least one user in the optimal 
allocation by z*. Reindex the resources in O* by {1,2,..., z*}. Let M. = {m : m\ + TO2 + . . . + 
m z * < M, mi > 0, Vi G {1, 2, . . . , z*}}. In an exploitation block where each user computed the optimal 
allocation correctly, the regret in that block due to randomizations before settling to the optimal allocation 
in this block is upper bounded by 

° B ' mm m£M P D LOE(m) ' 

where 

(M — to)! /ni\ n i- m i /n z ,\ n ^~ m ^ 

PDLOEirn) = ^___(__) ...(__) 

Proof: Consider user i an an exploitation block I in which it knows the optimal allocation, i.e., 
n l {l) = n*. Since user i does not know the selections of other users, knowing the optimal allocation 
is not enough for users to jointly select the optimal allocation. At time t in exploitation block /, user i 
selects a resource k G Oi{t), then observes . If < n* k it selects the same channel at t+1. Otherwise, 
it selects a channel randomly from O* at t + 1. The probability that user i selects channel k is n* k /\0*\. 

Note that this randomization is done at every slot by at least one user until the joint allocation of 
user's corresponds to the optimal allocation or until the exploitation block ends. Next, we will show that 
when the users follow DLOE, at every slot, the probability that the joint allocation settles to the optimal 
allocation is grater than p\y which is the probability of settling to the optimal allocation in the worst 
configuration of users on channels. Note that n\ + ri2 + . . . + n z * = M. Consider the case where to 
of the users do not randomize while the others randomize. Let m = (toi,TO2, . • • , m z *) be the number 
of users on each resource in O* who does not randomize. We have to = m\ + to-2 + . . . + m z * . The 



20 



probability of settling to the optimal allocation in a single round of randomizations is 

(M — m)\ /ni\™i~ m i /n z *\ n **- m z* 

PDLOE(m) = j- _ — ^ - - - j— _ — ^ {-) ...{ — ) 

Where M\/ (riling. . . . n*»!) is the number allocations ct which results in the unique optimal allocation 
in terms of number of users using resources, and 

/ni\ n i- m i /nrN™ 2 *"™ 2 * 

\m) "\WJ 

is the probability that such an allocation happens. Then, 

p w = min PDLOE(m). 

Let nj be the allocation at the beginning of an exploitation block in which all users computed the 
optimal allocation correctly. Let p ni (t) be the probability of settling to the optimal allocation in ifh round 
of randomizations in this exploitation block. Then the expected number of steps before settling to the 
optimal allocation is 

oo t— 1 

t=l 1=1 

Since pw < Pnj (t) for all n/ and t, we have 



E[t op ] < tPwO- - p™) 1 ' 1 = Vpv 



t=i 



Lemma 9: The regret due to randomization before settling to the optimal allocation is bounded by 

B M 3 K/3, 

where 

oo 1 

t=i 1 

Proof: A good exploitation block is an exploitation block in which all the users computed the 
optimal allocation correctly. A bad exploitation block is a block in which there exists at least one user 
who computed the optimal allocation incorrectly. The worst case is when each bad block is followed by 
a good block. The number of bad blocks is bounded by Lemma [7] After each such transition from a bad 
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block to a good block, the expected loss is at most Ob, which is given in Lemma [8] ■ 
Lemma 10: When users use DLOE, for any t > which is the beginning of an exploitation block, 
the regret due to switchings by time t is upper bounded by 

C SWC M (Vciogi + O b \og b (^(f - + M 2 K{\ogt + 1] 

Proof: By Lemma [4j the time spent in exploration blocks by t is bounded by N'L log t. Since 
resource rewards is always in (0, 1] at most N'L log t expected regret per user can result from explorations. 
Number of exploitation blocks is bounded by log 6 — N')). If an exploitation block is a good block 

as defined in Lemma [9] users will settle to the optimal allocation in Ob expected time slots. By time t, 
there are at most log 6 (^^(t — -/V')) exploitation blocks. Hence, the regret in good exploitation blocks 
cannot be larger than C swc Ob log?, {j^{t — -W'))- ^ an exploitation block is a bad block, by Lemma |6j 
the expected number of slots spent in such exploitation blocks is upper bounded by M 2 K(log(ti) + 1). 
Assuming in the worst case all users switch at every slot in a bad block, regret due to switching in bad 
exploitation blocks is upper bounded by C swc M 3 (log(ti) + 1). ■ 
Combining all the results above we have the following theorem. 

Theorem 1: If all users use DLOE with L > 1/e 2 , at the be ginning of Ith exploitation block, the regret 
defined in Q is upper bounded by, 

(MN'L + M 3 K) ]og(t,) + M 3 K(fiO B + 1), 

and the regret defined in ([5]) is upper bounded by 

M 3 K(logti + 1)(1 + C swc ) + MN'L logt,(l + C swc ) 

+ M\og b [^-(t - N')j (C swc O B + C cmp ) + O b M 3 K(3, 

where O b given in Lemma [8j is the worst case expected hitting time of the optimal allocation given all 
users know the optimal allocation, and ft = Y^tLi V* 2 - 



Proof: The result follows from summing the regret terms from Lemmas HI ml M p\ and Lemma 10 



D. Analysis of regret for the Markovian problem 

In this subsection we analyze the regret of DLOE in case of Markovian rewards. The analysis in this 
section is quite different from the analysis in Section IV-C due to the Markovian rewards. Apart from 
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the bounds on the parts of the regret given in Section ??, our next step is to bound the regret caused by 
incorrect calculation of the optimal allocation by some user. Although the proof of following the lemma 
is very similar to the proof of Lemma [6| due to the Markovian nature of the rewards, we need to bound 
a large deviation probability from the sample mean of each resource-usage pair for multiple contiguous 
segments of observations from that resource-usage pair. For the simplicity of analysis, we assume that 
DLOE is run with parameters a = 2, b = 4, c = 4, for the Markovian rewards. Similar analysis can be 
done for other, arbitrary parameter values. Let e := A m ; n /(2M). 

Lemma 11: Under the Markovian model, when each user uses DLOE with constant 

L > max{l/e 2 ,50S , max r| max /((3 - 2\/2)u min )}, 

the regret due to incorrect calculations of the optimal allocation at the beginning of the Zth exploitation 
block is at most 

M 3 K ( J- + ^(M*0 + I)- 

\\og2 10r Ejmin J 7r min 

Proof: Similar to the analysis for the i.i.d. rewards, let H(ti) be the event that at the beginning of the 
Zth exploitation block, there exists at least one user who computed the optimal allocation incorrectly, and 
let oj be a sample path of the stochastic process generated by the learning algorithm and the stochastic 
rewards. Proceeding the same way as in the proof of Lemma [6] by ( [TT] ) and ( [T2] ) we have, 

M K M 

P(u € Hfa)) < E E E 2P (£W^,nfa)) " /4,n > e )- ( 15 ) 
i=l k=l n=l 

Since t\ is the beginning of an exploitation block we have N % kn {ti) > Llogfy, Vi € M, k € K.,n G M. 



This implies that A^„(tz) > J Nl n (ti)L\ogti. Hence 
n4,„«nfo))-/4,n>e) 

= PiNi^m^Ni^)) - JVj >n (t,)4, n > eNi )n (tt)) 

< P (^, n (*l)AU( iV *,n(*0) " H,n{U)A,n > ^ H^L log t t ) (16) 



To bound ( 16 1, we proceed the same way as in the proof of Theorem 1 in 111311 . The idea is to separate 



the total number of observations of the resource-usage pair (k,n) by user i into multiple contiguous 



segments. Then, using a union bound, (16) is upper bounded by the sum of the deviation probabilities 



for each segment. By a suitable choice of the exploration constant L, the deviation probability in each 
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segment is bounded by a negative power of t n . Combining this with the fact that the number of such 
segments is logarithmic in time (due to the geometrically increasing block lengths), for block length 
parameters a = 2, b = 4, c = 4 in DLOE, and for 



we have, 



L > max{l/^, 50^ ax r^ max /((3 - 2V2)v min )}, 



< 



1 



+ 



V2L \ S, 



log 2 10r s . 



min / ^"min 



max ,—2 



Continuing from (15), we get 



P(u G Hfa)) < M 2 K ( At + , I —tT 2 



(17) 



ylog2 ' 10r S ,min J 7T min 1 

The result is obtained by continuing the same way as in the proof of Lemma [6] ■ 
The following lemma bounds the expected number of exploitation blocks where some user computed 

the optimal allocation incorrectly. 

Lemma 12: Under the Markovian model, when each user uses DLOE with constant 

^2 Jl 



L > max{l/e 2 , 505 max r^ max /((3 - 2V2)u min )}, 

the expected number exploitation blocks up to any t in which there exists at least one user who computed 
the optimal allocation wrong is bounded by 

"2L \ Sri 



E 



5^J( W €H(t,)) 



1=1 



< M 2 K ( — + \ 

\ log 2 10r Simin J TTmin 



where /3 = YT=i V* 2 - 



Proof: Proof is similar to the proof of Lemma 11 using the bound (11) for P(ui G H(ti)). ■ 
Next, we bound the regret due to the randomization before settling to the optimal allocation in 
exploitation slots in which all users have computed the optimal allocation correctly. 

Lemma 13: The regret due to randomization before settling to the optimal allocation is bounded by 



(0 B + C P )M 3 K 



1 



+ 



f 2L \ S u 



,min / ""mm 



"A 



^log2 10r E)] 

where O b given in Lemma [8} is the worst case expected hitting time of the optimal allocation given all 
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users know the optimal allocation, j3 = J2tL\ V* 2 * ar *d Cp = maxfc g £ Cpk where Cp is a constant that 
depends on the transition probability matrix P. 

Proof: A good exploitation block is an exploitation block in which all the users computed the 
optimal allocation correctly. A bad exploitation block is a block in which there exists at least one user 
who computed the optimal allocation incorrectly. By converting the problem into a simple balls in bins 
problem where the balls are users and the bins are resources, the expected number of time slots spent 
before settling to the optimal allocation in a good exploitation block is bounded above by Op- The worst 
case is when each bad block is followed by a good block, and the number of bad blocks is bounded by 



Lemma 12 Moreover, due to the transient effect that a resource may not be at its stationary distribution 
when it is selected, even after settling to the optimal allocation in an exploitation block the regret of at 
most Cp can be accrued by a user. This is because the difference between the i-horizon expected reward 
of an irreducible, aperiodic Markov chain with an arbitrary initial distribution and t times the expected 
reward at the stationary distribution is bounded by Cp independent of t. Since there are M users and 
resource rewards are in [0, 1], the result follows. ■ 

Similar to the i.i.d. case, the next lemma bounds the regret due to switchings in the Markovian case. 

Lemma 14: When users use DLOE, for any t > which is the beginning of an exploitaiton block, 
the regret due to the switchings by time t is upper bounded by 

C SWC M (N'Llogt + B log, ( b -^l(t - N')) + M 2 K ( J- + zr^-} ^(logt + 1)) . 
\ \ a ) yog2 10r Ejmin y 7r min J 

Proof: This proof is similar to the proof of Lemma 10 for the i.i.d. model. By Lemma [4| the time 
spent in exploration blocks by time t is bounded by N'Llogt. Since rewards are always in [0, 1], at most 
N'L log t expected regret per user can result due to explorations. The number of exploitation blocks is 
bounded by log b (*-^{t — N')). If an exploitation block is a good block as defined in Lemma 
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users 



will settle to the optimal allocation in Op expected steps. By time t, there are at most log fe — N')) 

exploitation blocks. Hence, the regret due to switchings in good exploitation blocks cannot be larger than 
C swc Op \og h (^^(i — AT')). If an exploitation block is a bad block, by Lemma |TT| the expected number 
of slots spent in such exploitation blocks is upper bounded by 

M 2 K ( J- + ^(logfr) + 1). 

\ log 2 10r Simin / TTmin 
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Assuming in the worst case all users switch at every slot in a bad block, the regret due to switchings in 
bad exploitation blocks is upper bounded by 

C SWC M"K ( J- + ^(logfr) + 1). 

\ log 2 10r Simin / 7T min 



Combining all the results above we have the following theorem. 

Theorem 2: Under the Markovian model, when each user uses DLOE with constant 

^2 2 



L > max{l/e , 505 , max r S max /((3 - 2V2)u min )}, 
then at the beginning of Ith exploitation block, the regret defined in ([4]) is upper bounded by 



MN'L + M 3 K ( — *— + -p^- ^ | logfe) 
\ log 2 10r Simin / 7r min / 



+ M*K ( J- + -p*L- ) ^W(O b + C P ) + 1), 
\ log 2 10rv min / 7r min 



and the regret defined in (|5]) is upper bounded by 

(i ./or \ q 

r^T + TFT — C 1 + C««c)(log<i + 1) 

log 2 10r Simin / 7r min 



+ M(C cmp + C su , c O B ) log 4 Q(t, - N'fj 



+ (O b + C P )M 3 K [J- + -J^L ) ^/3, 
ylog2 10r S)m i n y 7r min 

where O # given in Lemma [8j is the worst case expected hitting time of the optimal allocation given all 
users know the optimal allocation, j3 = Ylt^i l/* 2 > an d Cp = maxk^K Cp^ where Cp is a constant that 
depends on the transition probability matrix P. 

Proof: The result follows from summing the regret terms from Lemmas [4j |TT] [13] [5] and 14 and 
the fact that a = 2, b = 4. ■ 
Our results show that when initial synchronization between users is possible, logarithmic regret, which 
is the optimal order of regret even in the centralized case can be achieved. Moreover, the proposed 
algorithm does not need to know whether the rewards are i.i.d. or Markovian. It achieves logarithmic 
regret in both cases. 
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V. A DISTRIBUTED SYNCHRONIZED ALGORITHM FOR USER-SPECIFIC REWARDS 

In this section we consider the model where the resource rewards are user-dependent. Different from 
the previous section, where a user can compute the socially optimal allocation based only on its own 
estimates, with user-dependent rewards each user needs to know the estimated rewards of other users in 
order to compute the optimal allocation. We assume that users can communicate with each other, but 
this communication incurs a cost C com . For example, in an OS A model, users are transmitter-receiver 
pairs that can communicate with each other on one of the available channels, even when no common 
control channel exists. In order for communication to take place, each user can broadcast a request 
for communication over all available channels. For instance, if users are using an algorithm based on 
deterministic sequencing of exploration and exploitation, then at the beginning, a user can announce 
the parameters that are used to determine the block lengths. This way, the users can decide on which 
exploration and exploitation sequences to use, so that all of them can start an exploration block or 
exploitation block at the same time. After this initial communication, before exploitation block, users 
share their perceived channel qualities with each other, and one of the users, which can be chosen in a 
round robin fashion, computes the optimal allocation and announces to each user the resource it should 
select in the optimal allocation. Next, we propose the algorithm distributed learning with communication 
(DLC) for this model. 

A. Definition of DLC 

Similar to DLOE, DLC (see Figure [2]) consists of geometrically increasing exploration and exploitation 
blocks. The predetermined exploration order allows each user to observe the reward from each resource- 
usage pair, and update the sample mean rewards of each resource-activity pair. Note that in this case, 
since feedback about is not needed, each user should be given the number of users using the same 
resource with it for each resource in the predetermined exploration order. Similar to the previous section, 
this predetermined exploration order can be seen as an input to the algorithm from the algorithm designer. 
On the other hand, since communication between the users is possible, the predetermined exploration 
order can be determined by a user and then communicated to the other users, or users may collectively 
reach to an agreement over a predetermined exploration order by initial communication. In both cases, 
the initial communication will incur a constant cost d. 

Let Mi be the exploration sequence of user i, which is defined the same way as in Section [TV] and 
Ci{z) be the number of users using the same resource with user i in the zth slot of an exploration 
block. Based on the initialization methods discussed above both Mi and d are known by the user i at 
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the beginning. Note that different from DLOE, DLC only uses the resource-usage pair reward estimates 
from the exploration blocks. However, if the user who computed the optimal allocation announces the 
resources assigned to other users as well, then rewards from the exploitation blocks can also be used to 
update the resource-usage pair reward estimates. 

Our regret analysis in the following sections shows that estimates from the exploration blocks alone 
are sufficient to achieve logarithmic regret. 

B. Analysis of the regret of DLC 

In this section we bound the regret terms which are same for both i.i.d. and Markovian resource 
rewards. 

Lemma 15: For any t > which is in an exploitation block, regret of DLC due to explorations by 
time t is at most 



M N'L log t. 

Proof: Since DLC uses deterministic sequencing of exploration and exploitation the same way as 



Since communication takes place at the beginning of each exploitation block, it can be computed the 
same way as computation cost is computed for DLOE. Moreover, since resource switching is only done 
during exploration blocks or at the beginning of a new exploitation block, switching costs can also be 
computed the same way. The following lemma bounds the communication, computation and switching 
cost of DLC. 

Lemma 16: When users use DLC, at the beginning of the Zth exploitation block, the regret terms due 
to communication, computation and switching are upper bounded by 



respectively, where d is the cost of initial communication. 

Proof: Communication is done initially and at the beginning of exploitation blocks. Computation 
is only performed at the beginning of exploitation blocks. Switching is only done at exploration blocks 



DLOE, the proof is same as the proof of Lemma [4] by using the bound for T Q (t). 
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or at the beginning of exploitation blocks. Number of exploitation blocks is bounded by (10 1, and time 
slots in exploration blocks is bounded by ([9]). ■ 
In the next subsections we analyze the parts of regret that are different for i.i.d. and Markovian rewards. 

C. Analysis of regret for the i.i.d. problem 

In this subsection we analyze the regret of DLC in the i.i.d. model. The analysis is similar with the 



user-independent reward model given in Section IV 



Lemma 17: Under the i.i.d. model, when each agent uses DLC with constant L > 1/e 2 , regret due to 
incorrect calculations of the optimal allocation at the beginning of the Zth exploitation block is at most 

M 3 K(/o 5 (tO + l). 

Proof: Let H(ti) be the event that at the beginning of the Zth exploitation block, the estimated 
optimal allocation calculated by user (Z mod M) + 1 is different from the true optimal allocation. Let 
a; be a sample path of the stochastic process generated by the learning algorithm and the stochastic arm 
rewards. The event that user (7 mod M) + 1 computes the optimal allocation incorrectly is a subset of 
the event 

{|£fc,nCN*,n(*l)) - /4,nl ^ e for some i^M,ke!C,n£ M}. 

Analysis follows from using a union bound, taking the expectation, and then using a Chernoff-Hoeffding 
bound. Basically, it follows from ( fTT| ) in Lemma [6] ■ 

The following lemma bounds the expected number of exploitation blocks in which the optimal allo- 
cation is calculated incorrectly. 

Lemma 18: When agents use DLC with L > 1/e 2 , the expected number exploitation blocks up to any 
t in which the optimal allocation is calculated incorrectly is bounded by 



E 



1=1 l 



£/(o; €#(*,)) 
.1=1 

where p = £ t =i l/t 2 . 

Proof: Please see the proof of Lemma [7] ■ 
Combining all the results above we have the following theorem. 

Theorem 3: If all agents use DLC with L > 1/e 2 , at the beginning of Ith exploitation block, the regret 
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is upper bounded by, 

(MN'L(1 + C swc ) + M 3 K) logfe) + (C com + C cmp + C swc )M\og b ( ^(*J - AO J + ^ 3 ^- 



Proof: The result follows from combining the results of Lemmas 15 16 17 and 18 



D. Analysis of regret for the Markovian problem 

We next analyze the regret of DLC in the Markovian model. The analysis in this section is similar to 



the ones in Section V-D We assume that DLC is run with parameters a = 2, b = 4, c = 4. 



Lemma 19: Under the Markovian model, when each user uses DLC with constant 

L > max{l/e 2 , 5(< ax 4, max /((3 - 2x/2> min )}, 

the regret due to incorrect calculations of the optimal allocation at the beginning of the Zth exploitation 
block is at most 

M 3 K ( J- + -^L.) + !)• 

\ log 2 10rv min / 7T min 



Proof: The proof follows the proof of Lemma 1 1 



The following lemma bounds the expected number of exploitation blocks where some user computed 
the optimal allocation incorrectly. 

Lemma 20: Under the Markovian model, when each user uses DLC with constant 



L > max{l/e , 50S , max r E max /((3 - 2V2)u m i n )}, 

the expected number of exploitation blocks up to any t in which there exists at least one user who 
computed the optimal allocation wrong is bounded by 



E 



where p = £ t =i V* 2 - 



£j(w€ff(t,)) 



l=i 



< M 2 K (— + -^^-) f^p, 
\log2 10r s 

,min / ^"min 



Proof: The proof follows the proof of Lemma 12 
Combining all the results above we have the following theorem. 
Theorem 4: Under the Markovian model, when each user uses DLC with constant 



L > max{l/e^,505'^ ax r| ax /((3 - 2V2)v min )}, 
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then at the beginning of the Ith exploitation block, the regret is upper bounded by 

(mN'L(1 + C swc ) + M*K [J- + ,S ^] logft) 

y ylog2 10r Ejmin y 7r min J 

+ (C com + C cmp + C swt )M \og 4 (htt - N')) + M 3 K ( + ) ^(/3(C P ) + 1), 

V 2 / yiog2 i0r Ejmin y 7r min 

where Cp = max^jc Cpk where Cp is the constant that depends on the transition probability matrix P. 



Proof: The result follows from combining results of Lemmas 15 16 19 and|20j and the fact that 
a = 2, b = 4. Note that due to the transient effect that a resource may not be at its stationary distribution 
when it is selected, even when all users select resources according to the optimal allocation, a deviation 
of at most Cp from the expected total reward of the optimal allocation is possible. Therefore, at most Cp 
regret results from the transient effects in exploitation blocks where the optimal allocation is calculated 
correctly. The last term in the regret is a result of this. ■ 

VI. Numerical Results 

In this section we consider an opportunistic spectrum access problem in a cognitive radio network 
consisting of K = 3 channels and M = 3 users. We model the primary user activity on each channel 
as an i.i.d. Bernoulli process with 9^ being the probability that there is no primary user on channel k. 
At each time step a secondary user senses a channel and transmits with code division multiple access 
(CDMA) scheme if there is no primary user on that channel. Therefore, when there is no primary user 
present, the problem reduces to a multi-channel CDMA wireless power control problem. If channel k is 
not occupied by a primary user, the rate secondary user i gets can be modeled by (see, e.g. (26]), 



(h k P k \ 

where is the channel gain between transmitter of user j and receiver of user i, P* is the transmit 
power of user j on channel k, N Q is the noise power, and 7 > is the spreading gain. 

If a primary user is present on channel k, in order not to cause interference, secondary users should not 
transmit on that channel hence they get zero reward from that channel. We assume that the rate function 
is user-independent, i.e., h\ = h k ,Vi £ M, h k { = h k ,\/i / j G M, P k = P k , Vi G M. Values of the 
parameters of the rate functions and primary user activity are given in Table [I] We assume that N a = 1 
and 7 = 1. In the optimal allocation under these values, the number of users on channels 1,2,3 is 0,2,1 
respectively which is not an orthogonal allocation. 
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Each user applies the DLOE algorithm in a decentralized way. In Figure |3j we plot regret/log t of 
DLOE under both definitions of the regret given in Q and Q, respectively. Our results are averaged 
over 10 runs of the algorithm. We took e to be the half of the difference between value of the optimal 
allocation and the second best allocation which is 0.0811. We simulate for two values of the exploration 
constant L = 1/e 2 = 152 and L = 4/e 2 = 608. The computational cost of calculating the estimated 
optimal allocation is C cmp = 100. We observe that there is a constant difference between the two plots 
due to the logarithmic number of computations of the optimal allocation. In both plots, we observe an 
initial linear increase in regret. This is due to the fact that users start exploiting only after they have 
sufficiently many explorations, i.e., Xo(t) > Llogt. It can be deduced that if this difference decreases, a 
smaller e should be chosen, which will result in a larger L hence increase in the number of explorations. 
This will cause the initial linear region to expand, incresing the regret. 

In Table [n| we give the percentage of times the optimal allocation is played. We observe that the 
users settle to the channels which are used in the optimal allocation, by estimating the optimal allocation 
correctly and by randomizing if there are more users on the channel than there should be in the estimated 
optimal allocation. The largest contribution to the regret comes from the initial exploration blocks. We 
see that after the initial exploration blocks the percentage of times steps in which the optimal allocation 
is played increases significantly, up to 90% for L = 152 and up to 60% for L = 608. Although the 
estimates of the mean rewards of channel-activity pairs is more accurate with a larger L, more time is 
spent in exploration, thus the average number of plays of the optimal allocation is smaller. It is clear 
that under both cases as time goes to infinity, the percentage of time slots in which optimal allocation is 



played approaches 100%. In Table III we observe the percentage of time steps that a channel is selected 
by user 1. We see that after the initial explorations, user 1 chooses most of the time the channels that 
are used by at least one user in the optimal allocation. We see that the percentage of times channel 1 is 



5 



selected by user 1 falls below 4%, for L = 152 and 15% for L = 608 at time 5 x 10 

VII. Discussion 

In this section we comment on extensions of our algorithms to more general settings and relaxation 
of some assumptions we introduced in the previous sections. 



A. Unknown sub-optimality gap 

Both algorithms DLOE and DLC requires that users know a lower bound e on the difference between 
the estimated and true mean resource rewards for which the estimated and true optimal allocations 
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channel 


1 


2 


3 




1/8 


1/3 


1/5 


h k 


5 


10 


15 


h k 


1 


1.2 


3 


pk 


1 


1 


1 



TABLE I 

Simulation parameters 



t 


10 2 


10 3 


10 4 


10 5 


5 x 10 5 


L = 152 


9 


9 


8 


50 


90 


L = 608 


9 


9 


7 


10 


60 



TABLE II 

Percentage of time the optimal allocation is selected up to t. 



coincide. Knowing this lower bound, DLOE and DLC chooses an exploration constant L > 1/e 2 so that 
after N'L log t time steps spent in exploration is enough to have reward estimates that are withing e of 
the true rewards with a very high probability. 

However, e depends on the suboptimality gap S which is a function of the true mean resource rewards 
which is unknown to the users at the beginning. This problem can be solved in the following way. Instead 
of using a constant exploration constant L, DLOE and DLC uses an increasing exploration constant L(t) 



t(D = 152) 


10 2 


10 3 


10 4 


10 5 


5 x 10 5 


Channel 1 


46 


44 


46 


18 


4 


Channel 2 


27 


28 


31 


41 


48 


Channel 3 


27 


28 


23 


41 


48 


t{D = 608) 


10' 2 


10 3 


10 4 


10 5 


5 x 10 5 


Channel 1 


46 


44 


46 


37 


15 


Channel 2 


27 


28 


31 


37 


65 


Channel 3 


27 


28 


23 


26 


20 



TABLE III 

Percentage of channels selected by user 1 up to t. 
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such that L(l) = 1 and L(t) — > oo as t — > oo. By this way the requirement that L(t) > 1/e 2 is satisfied 
after some finite number of time steps which we denote by To. In the worst case, MTq regret will come 
from these time steps where L(t) < 1/e 2 . After To, only a finite (time-independent) regret will result from 
incorrect calculations of the optimal allocation due to the inaccuracy in estimates. Since both DLOE and 
DLC explores only if the least explored resource-congestion pair is explored less than L{t)\ogt times, 
regret due to explorations will be bounded by MN'L(t) logi. Since the order of explorations with L(t) is 
greater than with constant L, the order of exploitations is less than the case with constant L. Therefore, 
the order of regret due to incorrect calculations of the optimal allocation, switchings in exploitation 
blocks, computation and communication at the beginning of exploitation blocks after To is less than the 
corresponding regret terms when L is constant. Only the regret due to switchings in exploration blocks 
increases to C SWC MN' L(t) logi. Therefore, instead of having Oilogt) regret, without a lower bound on 
e, the proposed modification achieves 0(L(t)logt) regret. 

B. Multiple optimal allocations 

For user-specific rewards with costly communication, the user who computed the estimated optimal 
allocation announces to other users which resources they should select. Since the users are cooperative 
and follow the rules of the algorithm, even if there are multiple optimal allocations, they will all use 
the allocation communicated in this way. The problem arises when there are multiple optimal allocations 
in the user-independent rewards with limited feedback case. According to DLOE, if there are multiple 
optimal allocations, even though if all users correctly find an optimal allocation, they may not pick 
the same optimal allocation since they cannot communicate with each other. To avoid this problem, we 
proposed Assumption [1] which guarantees the uniqueness of the optimal allocation. We now desribe a 
modification on DLOE so this assumption is no longer required. 

Let e := A m i n /(2M), where A m i n is the minimum suboptimality gap given in ([!]). Consider user i, 
and an exploitation block /. For a resource which is in the set of estimated optimal resources Oi(l), if 
user i observes at the last time slot of that exploitation block, number of users less than or equal to 
the number of users in the estimated optimal allocation on that resource, at the begining of the next 
exploitation block, it will increase the total estimated reward of that allocation by e/4. If this is not true, 
then it will decrease the total estimated reward of that allocation by e/4. Note that when the estimated 
rewards are accurate enough, i.e., within e/2 of the true resource rewards with a very high probability, 
even with an e/4 subsidy, a suboptimal allocation will not be chosen. Moreover by this way, among the 
optimal allocations, users will settle to the resources for which the number of users using it is at most the 
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number of users using it in the optimal allocation. By this modification, after the mean reward estimates 
are accurate enough, the users will settle to one of the optimal allocations in finite expected number of 
exploitation blocks. 

VIII. Conclusion 

In this paper, we proposed distributed online learning algorithms for decentralized multi-user resource 
sharing problems. We analyzed the performance of our algorithms, and proved that they achieve logarith- 
mic regret under both i.i.d. and Markovian resource reward models when communication, computation 
and switching costs are present. We presented numerical analysis of dynamic spectrum access application 
which is a resource sharing problem. 
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Distributed Learning with Ordered Exploration (DLOE) for user i 
1: Initialize: t = 1, h = 0, h = 0, n = 1, X = 0, F = 2, z = 1, /en = 2, n = 0, N l k n = 0, 
Vfc €/C,n€ {1,2,... , Af} a,6,c€ {2,3,...}. 



2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 



while t > 1 do 

if F = 1 //Exploitation block then 

if < (t _i)(*-l) ><(*-!) then 

Pick tti(t) randomly from Oj with P(aj(i) = fc)n l fc /|Oj|. 
else 

= a«(t - 1) 

end if 

if rj = len then 

F = 
end if 

else if F = 2 //Exploration block then 

ai(t)=Af!(z) 
if 77 = Zen then 
77 = 0, + + z 
end if 

if z = |yV.| + 1 then 
= + len 

F = 
end if 
else 

//IF F = 

if Xo > L\ogt then 

//Start an exploitation epoch 

F = 1, + + lj, n = 0, len = a x 

//Compute the estimated optimal allocation 

n l = arg m&x neA f J2k=i Afc,n^(«fc / 0) 

Set Oj to be the set of resource in n l with at least one user, 
else if Xo < Llogt then 

//Start an exploration epoch 

F = 2, + + l , n = 0, len = c'° -1 , z = 1 
end if 
end if 

Let be the number of users on resource cti{t) 

-i 

( N ai(t),h(t) ~ 1 )AL s (t),« 1 (t) + r a t (t),h (*)(*) 



Q 4 (t),ii(t) 



37: + + 7?, + + t 

38: end while 



Fig. 1. pseudocode of DLOE 
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Distributed Learning with Communication (DLC) for user i 
1: Initialize: t = 1, h = 0, h = 0, n = 1, X Q = 0, F = 2, z = 1, /en = 2, /4,„ = 0, A^ n = 0, 



2 
3 
4 
5 
6 
7 
8 
9 
10 
11 



12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 

27 
28 
29 

30 
31 
32 
33 
34 
35 
36 
37 



Vfc <E /C, n € {1, 2, . . . , M} a,b,c £ {2, 3, . . .}. 
while i > 1 do 

if F = 1 //Exploitation block then 
Select resource aj(i) = a*, 
if n = len then 

F = 
end if 

else if F = 2 //Exploration block then 

+ + JV i ( (t)AW 

m _ - 1 )AL i (t),f < (t) + ^w^w^) 

Ma((t),Z 4 (t) - /yi 

^ V a*(t)AW 

if 77 = Zen then 
77 = 0, + + z 
end if 

if z = |A/"i| + 1 then 
= + ^en 
F = 
end if 
else 

//IF F = 

if Xo > L\ogt then 

//Start an exploitation epoch 

F = 1, + + lj, r) = 0, len = a x 

//Communicate estimated channel qualities n , k G /C, n G M with other users, 
if (// mod M) + l = i then 

//Compute the estimated optimal allocation, and sent each other user the resource it 

should use in the exploitation block. 

a* = argmax a6K M Y,Zi Aa () „ ai («) 
else 

//Receive the resource that will be user a* in the exploitation block from user (lj 
mod M) + 1. 
end if 

else if Xo < Llogrj then 

//Start an exploration epoch 

F = 2, + + l , V = 0, len = c z °~\ z=l 
end if 
end if 

+ + 7], + + t 

end while 



Fig. 2. pseudocode of DLC 
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Fig. 3. Regret/log t for DLOE with and without computational cost. 



